Robust Automatic Speech Recognition in the Presence of Impulsive Noise
نویسندگان
چکیده
This letter presents a technique for suppressing the effect of any type of impulsive noise in the context of Automatic Speech Recognition (ASR). The present study suggests the interposition in the Mel Frequency Cepstral Coefficients (MFCC) feature extraction front-end of a) an identification stage responsible for the detection of corrupted cepstral vectors and b) a restoration stage that acts locally on the degraded cepstrum. Introduction: ASR systems are composed of a feature preprocessing stage, which aims at extracting the linguistic message while suppressing non-linguistic sources of variability, and a classification stage (including language modelling), that identifies the feature vectors with linguistic classes. The extraction level of current ASR systems converts the input speech signal into a series of low-dimensional vectors, each vector summarizing the temporal and spectral behaviour of a short segment of the acoustical speech input. The ultimate goal is to estimate the sufficient statistics to discriminate among different phonetic units while minimizing the computational demands of the classifier. Although (ASR) has reached the state of launching commercial products, operational systems still face the problem of maintaining high recognition performance in adverse environments. The degradation of recognition performance is typically attributed to the mismatch between training and testing conditions. Robust ASR methods include signal enhancement techniques as a front-end and/or feature space transformations that reduce variability due to noise [1]. The inherent disadvantage in such techniques is that they make poor assumptions as to the nature of the noise. Our framework deals with a noise type difficult to model, which consists of short-time bursts of random amplitude, spectral content, onset time and frequency of occurrence. It can deal with heavily distorted speech that may extend to several hundred samples. The noise suppression scheme (Fig.1) avoids a global act on the original waveform, that is, it does not inflict any distortions on an already clean part of the spectrum. A two-layer time delay back-propagation network (TDNN) is trained in matched clean training and operational conditions. The TDNN is trained to predict the next cepstral vector on the basis of the current and the 7 previous ones. After convergence, for every cepstral vector corresponding to each frame of a validation set that consists of recordings not included in the training set, the absolute value of the difference between the predicted feature and the actual one is calculated. The 99% upper control limit is derived from the prediction error of the validation set (the lower limit is zero), using MATLAB Statistics Toolbox. The control limit is the 99% confidence interval on a new observation from the process. In operational use, if the mean square error between the predicted value and the actual one falls within the control limits, the feature is classified as clean and remains intact. In the case of a degraded cepstral vector the prediction error violates these confidence limits, therefore is classified as impaired and is subsequently replaced by the predicted cepstral vector (see Fig. 2). The identification of the outlier is attributed to the fact that the TDNN is trained with data clean of any kind of impulsive corruption as indicated by the annotated transcription. In addition, neural networks, although able to generalize well inside the space of training data, cannot interpolate outside this space. In practical tests, the prediction error of an impulsively degraded cepstral vector is almost one order of magnitude higher than the normal error. Description of the system: The appeal of the TDNNs for the identification of cepstrum outliers lies in their well-known ability to incorporate context information into the prediction procedure. This network topology has been proved capable of approximating any function with a finite number of discontinuities with – theoretically –, arbitrary accuracy [2]. Its mapping functionality is expressed by: ĈT+1=ƒ(CT, CT-1,..., CT-7) (1) where, CT=[C1,.., C13] denotes the vector of 12 static cepstral coefficients and a logenergy coefficient and T is the time frame. The number of hidden neurons having hyperbolic tangent activation functions is 300, while the output layer of linear activation functions has 13 nodes. The weight parameters are adjusted with resilient backpropagation using the mean-square error cost function. The training set of the TDNN consisted of the cepstral vectors of 2000 clean, phonetically balanced recordings taken from 200 male and female speakers and 1000 speech files comprised the validation set. The data were mean and variance normalized per one recording basis. The processed coefficients were subsequently expanded after appending first and second order derivatives to the static vector. As regards the reference recognition system, we made use of the HTK Toolkit. The basic recognition units were 5-state tied state, context dependent triphones. In order to train the speaker independent continuous speech recognizer, 3000 speakers from the SpeechDat database were used. For the evaluation experiments we used recordings each one containing a sequence of 6 digits, so the procedure can be considered as a large vocabulary, speaker independent task with a small vocabulary of ten possible digits. In operational use, we produced the control chart (Fig. 2) of the prediction error. A control chart is a plot of the prediction-error measurements over time with statistical limits applied. The impulsive noise suppression module is activated once the prediction error violates the control limits derived from the validation set. Simulation and results: Regarding SNR measurements in the case of impulsive noise we follow the notation of [3]. Let {Pim} denote the average power of each impulse and {Psig} the signal power. For the case of impulsive noise, the average signal to noise ratio depends on the average power of each impulse and on the rate of noisy pulse occurrences. We define an average signal to impulsive noise ratio, where {α} is the fraction of signal samples contaminated by impulsive noise as: P * α P 10log SINR(dB) im sig 10 = (2) Machine-gun noise from the NOISEX-92 database of varying frequency of occurrence, amplitude and onset of appearance was added to six-digits recordings clean of any kind of impulsive noise, with the result that the selected SINR was reached as illustrated in Fig. 3. Subsequently, the noisy files, as well as the enhanced waveforms, were led to the recognizer. Fig. 3 depicts the average word recognition results, which demonstrate a clear and considerable gain when the enhancement procedure is included in the MFCC front-end. A closer examination of the word recognition results reveals that recognition performance degrades mainly because of the high number of occurrences and not as a result of high-energy pulses. In cases where the impulses are rare, the performance is very high. The system fails only in cases of large clusters of adjacent, high energy disturbances, in the case of which the prediction is based on a large number of previously restored feature vectors. Conclusions: We wish to emphasize the practical utility of our approach, which does not seek to revise but, rather, furthers the discriminative ability of the already existing and successful MFCC front-end. We have not made use of any specific characteristics of the machine-gun noise since our framework is based on the identification of cepstral vectors which are not drawn from the underlying cepstrum of the speech signal and which are treated as outliers. As our framework works on a frame level basis on the cepstal vectors already extracted for the recognition purpose, the function of the outlier identification and restoration model comes at little to no extra computational overhead, thus achieving real time, on line performance. References: [1] GONG Y., “Speech recognition in noisy environments: A survey,” Speech Communication, 1995, 16, pp. 261-291. [2] HAYKIN S., “Neural Networks: A Comprehensive Foundation,” MacMillan, 1994. [3] VASEGHI S., “Advanced Signal Processing and Digital Noise Reduction,” Wiley Teubner, 1996. Authors’ Affiliations: I. Potamitis, N. Fakotakis, G. Kokkinakis, Wire Communications Laboratory, Electrical and Computer Engineering Dept., University of Patras, 261 10 Rion, Patras, Greece, Tel:+30 61 991722, Fax:+30 61 991855, e-mail: [email protected] Figure captions: Fig. 1: Block diagram of the impulsive noise removal front-end. Fig. 2: a) Spectrogram of a speech signal corrupted by two pulses (frame 39 and 60). b) Control Chart of the prediction error. Fig. 3: Word Recognition Accuracy (%) under SINR ranging from –10 to +20 dB. a) No enhancement applied. b) Impulsive noise suppression applied.
منابع مشابه
Improving the performance of MFCC for Persian robust speech recognition
The Mel Frequency cepstral coefficients are the most widely used feature in speech recognition but they are very sensitive to noise. In this paper to achieve a satisfactorily performance in Automatic Speech Recognition (ASR) applications we introduce a noise robust new set of MFCC vector estimated through following steps. First, spectral mean normalization is a pre-processing which applies to t...
متن کاملA Robust Distributed Estimation Algorithm under Alpha-Stable Noise Condition
Robust adaptive estimation of unknown parameter has been an important issue in recent years for reliable operation in the distributed networks. The conventional adaptive estimation algorithms that rely on mean square error (MSE) criterion exhibit good performance in the presence of Gaussian noise, but their performance drastically decreases under impulsive noise. In this paper, we propose a rob...
متن کاملSoft decision strategy and adaptive compensation for robust speech recognition against impulsive noise
This paper presents research on robust automatic speech recognition (ASR) in the presence of impulsive noise, which is usually caused by transmission errors or packet loss in network-based delivery of speech signals. A soft decision strategy is proposed by analyzing the degraded observation probabilities caused by impulsive noise. Based on the soft decision results, two compensation methods are...
متن کاملDesigning and implementing a system for Automatic recognition of Persian letters by Lip-reading using image processing methods
For many years, speech has been the most natural and efficient means of information exchange for human beings. With the advancement of technology and the prevalence of computer usage, the design and production of speech recognition systems have been considered by researchers. Among this, lip-reading techniques encountered with many challenges for speech recognition, that one of the challenges b...
متن کاملSpeech Emotion Recognition Based on Power Normalized Cepstral Coefficients in Noisy Conditions
Automatic recognition of speech emotional states in noisy conditions has become an important research topic in the emotional speech recognition area, in recent years. This paper considers the recognition of emotional states via speech in real environments. For this task, we employ the power normalized cepstral coefficients (PNCC) in a speech emotion recognition system. We investigate its perfor...
متن کاملروشی جدید در بازشناسی مقاوم گفتار مبتنی بر دادگان مفقود با استفاده از شبکه عصبی دوسویه
Performance of speech recognition systems is greatly reduced when speech corrupted by noise. One common method for robust speech recognition systems is missing feature methods. In this way, the components in time - frequency representation of signal (Spectrogram) that present low signal to noise ratio (SNR), are tagged as missing and deleted then replaced by remained components and statistical ...
متن کامل